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O Abstract 

The deep Boltzmann machine (DBM) has been an important development in the 

quest for powerful "deep" probabilistic models. To date, simultaneous or joint 

^5 training of all layers of the DBM has been largely unsuccessful with existing 

£— v training methods. We introduce a simple regularization scheme that encourages 

psi the weight vectors associated with each hidden unit to have similar norms. We 

demonstrate that this regularization can be easily combined with standard stochas- 
i - 1 tic maximum likelihood to yield an effective training strategy for the simultaneous 

P-J training of all layers of the deep Boltzmann machine. 



CO 

■^ 1 Introduction 



Since its introduction by Salakhutdinov and Hinton Q, the deep Boltzmann machine (DBM) has 
been one of the most ambitious attempts to build a probabilistic model with many layers of latent or 
hidden variables. The DBM shares some characteristics with the earlier deep belief network (DBN) 
J3]. Both models may be viewed as multi-layer or "deep" extensions of the popular restricted Boltz- 

tH- mann machine (RBM). However, unlike the DBN, commonly used DBM approximate inference 

schemes implement true feedback mechanisms where the inferred activations of high-level units can 

^-J influence the activations of lower-level units. Thus alternative interpretations of an input are able to 

compete at all levels of the model simultaneously. This sort of sophistication in inference has the 
potential to lead to more robust and globally coherent inferences, which are likely to translate to 
better performance when the model is employed in tasks such as classification. 



Despite its success, the DBM has not yet entirely fulfilled this potential and the DBN remains the 
more popular model paradigm for deep probabilistic models. One potential explanation of why 
this is so could be tied to the difficulties one encounters when attempting to estimate the model 
parameters from data. Straightforward applications of gradient-based methods such as stochastic 
maximum likelihood (SML) ifTUl (also known as persistent contrastive divergence [9 1) appear to fall 
in poor local minima, and fail to adequately explore the space of model parameters. 

One solution, as presented in |7|, is based on a greedy layer- wise pretraining strategy which appears 
to overcome these poor local minima . Each layer is trained as an RBM with the latent activations 
of the layer below as its input. These layers are then recombined by rescaling the weights to account 
for the doubling of inputs into each intermediate layer. While this procedure seems to work well, 
it involves many steps and is more complicated than would be ideal. Also, more importantly, the 
necessity of layer-wise pretraining makes it very difficult for the organization of upper layers to 
influence the topological organization of lower layers. For example, in cases where we would like 
to learn a DBM that is not fully connected between each layer, it would potentially be very desirable 
for the global connectivity pattern (eg. local receptive fields) to influence the pattern of learned 
filters at all layers. Layer-wise pretraining precludes this possibility. If, on the other hand, one could 
jointly train all layers of a deep Boltzmann machine simultaneously, then the pattern of activation of 
the upper layer units would have an opportunity to influence the weights trained at the lower layers 
via their effect on lower-layer units activations. 



In this paper, we describe a simple scheme for joint training of all layers of a deep Boltzmann 
machine. Our strategy is based on the observation that the poor local minima in which SML falls 
are characterized by high variance in the norms of the weight vectors, particularly those between 
the data and the first layer of hidden units. Our solution is to simply add a regularization term to 
the standard maximum likelihood training criterion that penalizes large differences in the norms of 
the weight vectors, both within a layer and across neighbouring layers. Our regularization is based 
on the intuition that successful models are those where all units contribute roughly equally to the 
representation of the data. This is a common principle that has previously been applied in the form 
of sparsity penalties for RBM and DBN training |5], where all hidden units are encouraged to be 
active over an equal proportion of the data. Our application of this principal is somewhat different, 
we do not explicitly control the activation of the hidden units, instead we encourage equal influence 
of each active hidden unit. We demonstrate, with experiments, both the failure of standard SML 
training for DBMs and the effectiveness of our regularized-SML DBM training strategy. 

2 Boltzmann Machines 

A Boltzmann machine is defined as a network of symmetrically-coupled binary stochastic units 
(random variables). These stochastic units can be divided into two groups: (1) the visible units 
v € {0, 1} D that represent the data, and (2) the hidden units h G {0, 1}^ that mediate dependencies 
between the visible units through their mutual interactions. The pattern of interaction is specified 
through the energy function: 

E B m{v, h; 9) = -\v T Uv - \h T Vh - v T Wh - b T v - d T h, (1) 

where 9 = {U, V, W, b, d} are the model parameters which respectively encode the visible-to-visible 
interactions, the hidden-to-hidden interactions, the visible-to-hidden interactions, the visible self- 
connections, and the hidden self-connections (also known as biases). To avoid over-parametrization, 
the diagonals of U and V are set to zero. 

The energy function specifies the probability distribution over the joint space [v, h], via the Boltz- 
mann distribution: 

P(v,h) = ^-rCxp(-E BM (v,h;9)), (2) 

where the partition function Z(9) ensures that the density normalizes: 

Vi=l vd — 1 hi — 1 h^ — l 

z ( e ) = E ■ • • E E • ■ ■ E ex p (~ e bu(v, h; 9)) . o) 

vi— vn = hi— fojv= 

This joint probability distribution gives rise to the set of conditional distributions of the form: 

P(hi | v, h\i) = sigmoid ^ W ji v j + E V ™' h i' + d * ( 4 ) 



P(Vj | h, v\j) = sigmoid J2 W Ji v J + E U n' v r + h A ■ ( 5 ) 

\ i i'\j J 

In general, inference in the Boltzmann machine is intractable. For example, computing the con- 
ditional probability of hi given the visibles, P{hi \ v), requires marginalizing over the rest of the 
hiddens which implies evaluating a sum with 2 W_1 terms: 

h 1 = l hi-i=lhi+±=l h N = l 

p(hi\v)=^2--- J2 E ••• E p ^h (6) 

h 1= fcj_i=0 fci+i=0 h N =0 

However with some judicious choices in the pattern of interactions between the visible and hidden 
units, more tractable subsets of the model family are possible. 



2.1 Restricted Boltzmann Machines 

The restricted Boltzmann machine (RBM) is likely the most popular subset of Boltzmann machines. 



They are defined by restricting the interactions in the Boltzmann energy function, in Eq. 10 to only 
those between h and v, i.e. for £rbm, U = and V = 0. As such, the RBM can be said to 
form a bipartite graph with the visibles and the hiddens forming two layers of vertices in the graph. 
With this restriction, the RBM possesses the useful property that the conditional distribution over 
the hidden units factorizes given the visibles: 

P(h | v) = H P{h t | v) = H sigmoid j J2 WjiVj + d z J (7) 

Likewise, the conditional distribution over the visible units given the hiddens also factorizes: 

p(v i /o = n p ^ i h ) = n si s moid ( e w i ihi + b i ) (8) 

This conditional factorization property of the RBM immediately implies that most inferences we 
would like make are readily tractable. For example, the conditional independence of the hiddens 
implies that posterior marginals of the form P(hi | v) are immediately available. 

Importantly, the tractability of the RBM does not extend to its partition function, which still involves 
sums with exponential number of terms. It does imply however that we can limit the number of terms 
to min{2 r) , 2^}. Usually this is still an unmanageable number of terms and therefore we must resort 
to approximate methods to deal with its computation. 

Learning in the RBM is also rendered much more tractable in comparison to the general Boltz- 
mann machine. Typically, learning involves finding a set of model parameters that approximately 
maximizes the log likelihood of a training dataset: 

J2logP(v..,t;0)=J2 lo 9 E ••• E P(V:,uKf,e). (9) 

t t hi,t=0 h N , t =0 

This can be accomplished via gradient ascent. The gradient of the log likelihood of the data for the 
RBM is given by: 

— J2 lo SPKt) = - E ( m-^bmK(^) ) + E ( ijz-3rbm(u, h) 

0t,i \t=l I t=l\ OUi ' p(h:,t\V:, t ) t = l\ 0Ul I P(v,h) 

where we have the expectations with respect to p{h- :t \ v- A ) in the "clamped" condition, and over the 
full joint p(v. t , h. t ) in the "undamped" condition. In training, we follow the stochastic maximum 
likelihood (SML) algorithm (also know as persistent contrastive divergence or PCD) |[T0l l9l. i.e., 
performing only one or a few updates of an MCMC chain between each parameter update. 

3 Deep Boltzmann Machines 

The Deep Boltzmann Machine (DBM) is another particular subset of the Boltzmann machine fam- 
ily of models where the units are again arranged in layers. However unlike the RBM, the DBM 
possesses multiple layers of hidden units as illustrated in Figure [T] With respect to the Boltzmann 
energy function of Eq. [10] the DBM corresponds to setting U = and a sparse connectivity struc- 
ture in both V and W. We can make the structure of the DBM more explicit by specifying its energy 
function. For the 2-layer model it is given as: 

E DB m(v, hP\hPhO) = -v T Wh^ - ftW T Vh® - d« T ft« - dW T h &) - b T v, (10) 

with 8 = {W,V, d^ , d^ , b}. The DBM can also be characterized as a bipartite graph between two 
sets of vertices, formed by the units in odd and even-numbered layers (with v := h^ 0>> ). 
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Figure 1 : A graphical depiction of a 2-layer DBM. Undirected weights connect the visible layer to layer ft' 1 ' ; 
and layers ft' 1 ' to ft' 2 '. Note that v and ft' 2 ' are conditionally independent given ft' 1 '. 

3.1 Mean-field approximate inference 

A key point of departure from the RBM is that the posterior distribution over the hidden units 
(given the visibles) is no longer tractable, due to the interactions between the hidden units. 
Salakhutdinov and Hinton [7| resort to a mean-field approximation to the posterior. Specifically, 
in the case of the 2-layer model, we wish to approximate P (ft'- 1 -', h^ | u) with the factored 

distribution Q^ftW, ft( 2 ') = Jjf^ Q v (hfA Ua=i Qv (h^) ■ such that the KL divergence 

KL (P (ft' 1 ), ft*- 2 -* | v) WQvQi 1 , ft 2 )) is minimized or equivalently, that a lower bound to the log 
likelihood is maximized: 



logP(«)>£(Q t ,) = ^^Q u (ft (1) ,ft (2) )log 



hm h,m 



P(y,hM,hW) 

Q v (hW,hM) 



(11) 



Maximizing this lower-bound with respect to the mean-field distribution Q v (h l , ft 2 ) yields the fol- 
lowing mean field update equations: 



hf> «- sigmoid ^ WjiVj + J^ V lk h 
\ i k 

ftfV sigmoid [YjVikhV+df 



(2) 
k 



7(1) 



(12) 



(13) 



Iterating Eq. ( T2fT3 1 until convergence yields the parameters used to estimate the "variational posi- 
tive phase" of Eq.|14| 



£(Qv) 



d£(Q v ) 
86 



E Ql , 
-Re 



logP(w,ft (1) ,ft (2) )-logQ 1 ,(ft (1) ,ft (2) ) 

-^dbm(«, h (1 \h^) ~ \ogQ v (h^,h^)] - \ogZ(6) 



dE BBM (v,hW,hM) 
89 



E f 



dE BBM (v,hW,hM) 
d6 



(14) 



Note that this variational learning procedure leaves the "negative phase" untouched. It can thus be 
estimated through SML or Contrastive Divergence [2] as in the RBM case. 

3.2 Training Deep Boitzmann Machines 

Despite the intractability of inference in the DBM, its training should not, in theory, be much more 
complicated than that of the RBM. The major difference being that instead of maximizing the like- 
lihood directly, we instead choose parameters to maximize the lower-bound on the likelihood given 
in Eq.[TT] The SML-based algorithm for maximizing this lower-bound is as follows: 



1 . Clamp the visible units to a training example. 



2. Iterate over Eq. (T2fT3 i until convergence 



3. Generate negative phase samples v~, h^~ and ft/ 2 )~ through SML. 

4. Compute dC(Q v ) /dO using the values obtained in steps 2-3. 

5. Finally, update the model parameters with a step of gradient ascent. 

While the above procedure appears to be a simple extention of the highly effective SML scheme 
for training RBMs, as we demonstrate in Sec. B this procedure seems vulnerable to falling in poor 
local minima which leave many filters effectively dead (not significantly different from its random 
initialization with small norm). 

The failure of the SML joint training strategy was noted by Salakhutdinov and Hinton [7|. As a 
far more successful alternative, they proposed a greedy layer-wise training strategy. This procedure 
consists in pre-training the layers of the DBM, in much the same way as the Deep Belief Network: 
i.e. by stacking RBMs and training each layer to independently model the output of the previous 
layer. A final joint "fine-tuning" is done following the above SML-based procedure. 

4 Joint Training of a DBM 

While greedy layer-wise training has been shown to be reasonably successful, a means of joint 
training a DBM would be highly desireable. Not only would it be simpler, but it would open the 
door to local receptive field learning and more general architectures where we would like the top- 
down pattern of connectivity to influence the learning of lower-level features. In this section we 
detail our simple proposal for a means to jointly train all layers of a DBM. 

Our strategy is based on the observation that standard SML-based joint training tends produce high 
variance in the weight vectors associated with each hidden unit - particularly across the first-layer 
weight vectors that connect the visible units to the first hidden layer units. 

Our proposal is thus to regularize the maximum likelihood objective, in order to encourage units to 
have similar norms (i) within a given layer and (ii) to a lesser extent, across neighbouring layers. 
We thus introduce an additional parameter //^ for each layer of the DBM, representing the average 
norm of weight vectors of the Z-th layer. We then add a regularization term to our objective C(Q V ), 
which penalizes deviations from this mean-value using a squared-error penalty, [/,">'s belonging 
to adjacent layers are further constrained to be close (again in the 12 sense). This gives rise to the 
following regularized objective: 

m r £(Q,)+aW^(||W,|| 2 - M W) +a( 2 )^(||^|| 2 - M (2) ) +t(m (1) -M (2) ) ■ 

j=l i=l 

(15) 

One can think of this regularization term as a spring, which ensures that the system evolves jointly 
from an initial random weight configuration (where the columns of W"' have small norm) to a good 
model of the input distribution P(v), which undoubtedly requires weight vectors with larger norms. 

5 Experiments 

To evaluate the effect of our regularization term, we trained deep Boltzmann machines on the per- 
vasive MNIST dataset (4). All our models were trained for 10 6 updates, using minibatches of size 
50, varying the learning rate in {10~ 2 , 1CP 3 , 10~ 4 }. 

Direct Approach Our first model was a 3-layer DBM with [500,500,1000] hidden units in the 
first, second and third layers (respectively). The resulting filters are shown in Figure [2] 

As we can see, many of the first-layer filters are not significantly different from their random ini- 
tialization, with very small norm. Based on the difference between these results and the successful 
training of an RBM, we speculate that the top-down interactions prevent the first layer from learning 
useful filters. We have confirmed that the high-frequency "noise" filters are actually the ones with 



(a) Layer 1 



(b) Layer 2 



(c) Layer 3 



Figure 2: Random subset of filters obtained by a 784-500-500-1000 DBM, after 10 6 updates (trained jointly, 
using the direct approach). The learning rate was set to 10 -3 and the batch size to 50. Higher-level weights 
are visualized by performing a linear combination of lower-level weights, but only picking the 20 most active 
connections. 




(a) Layer 1 



(b) Layer 2 



Figure 3: Random subset of filters obtained by a 784-500-500 DBM, after 10 B updates. All layers were trained 
jointly using the regularized cost of Eq|l5l The learning rate was set to 10 -2 and the batch size to 50. 



lowest norm. It seems as though early top-down input is influencing the activations of these hidden 
units, perhaps directly through some form of suppression, or indirectly, by reinforcing the activation 
of the subset of filters which become useful early on. While we plot filters for 3-layer networks, the 
same results hold for networks of depth two. 



Joint Training 

each layer. We set the hyper-parameters of Eq 



Using our regularized objective, we train a 2-layer DBM with 500 hidden units in 

' ' a« = 



15 



as follows: 



0.1, a^ = 0.1 and 7 = 1. 



Furthermore, we dampen the learning rate on the parameters /i^ 1 ) and [i y - l) by a factor of a thousand, 
in order for the layer norms to evolve more slowly. A random subset of filters is shown in Figure!?] 

The effect of our regularization term is drastic: the vast majority of first layer filters seem to train 
successfully and resemble the pen-stroke detectors characteristic of features learned on MNIST. 
While more difficult to interpret, the second layer filters are also clearly superior to the ones of 
Figure 13] combining lower-level features (pen strokes) into more global digit-like objects. 



6 Discussion 



We have introduced a simple regularization scheme that appears to prevent SML from falling into a 
poor local minimum of our objective function. The regularizer encourages units to learn filters which 
have similar norms, both within and across adjacent layers. We have empirically demonstrated the 
failure of joint training of all layers of a deep Boltzmann machine which leaves many filters in 
the lower layer of a 2-layer DBM with very small norm and, consequently, little contribution to 
modeling the data. We have also empirically demonstrated the success of our regularization scheme 
in simultaneously training both layers of a 2-layer DBM. While we have not shown it here, our 
scheme also appears to work well for DBMs with more than 2 layers. In future work, we would like 
to determine how many layers can be leamt simultaneously using similar basic norm-regularizaiton 
schemes. 
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